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Abstract. We use mathematical methods from the theory of tailored random graphs to study 
systematically the effects of sampling on topological features of large biological signalling networks. 
Our aim in doing so is to increase our quantitative understanding of the relation between true 
biological networks and the imperfect and often biased samples of these networks that are reported 
in public data repositories and used by biomedical scientists. We derive exact explicit formulae 
for degree distributions and degree correlation kernels of sampled networks, in terms of the degree 
distributions and degree correlation kernels of the underlying true network, for a broad family of 
sampling protocols that include (un-)biased node and/or link undersampling as well as (un-)biased 
link oversampling. Our predictions are in excellent agreement with numerical simulations. 
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1. Introduction 



Networks are popular simplified representations of complex biological many- variable systems. The 
network representation reduces the complexity of the problem by retaining only information on 
^ which pairs of dynamical variables in a given system interact, leading to a graph in which the 
a I nodes (or vertices) represent the dynamical variables and the links (or edges) represent interacting 
pairs. If all interactions are symmetric under interchanging the two variables concerned, the 
resulting network is nondirected (as e.g. in protein-protein interaction networks). If some or all 
are nonsymmetric, the network is directed (as e.g. in gene regulation networks). Present-day 
biological databases contain protein-protein interaction networks and gene regulation networks 
of many species, with typically in the order of ~ 10^ — 10^ nodes each, measured and 
post-processed by various different techniques and protocols. However, in biology the available 
experimental techniques do not sample the complete system, but only a finite fraction; for the 
human protein-protein interaction network this fraction is believed to be presently around 0.5 
[1]. Furthermore, the sampling tends to be biased by which experimental method is used j2]. In 
order to use the available data wisely and reliably it is vital that we understand in quantitative 
detail how the topological characteristics of a real network relate to those of a finite (biased 
or unbiased) random sample of this network. If, for instance, we observe that certain modules 
appear more often (or less often) than expected in certain cellular signalling networks, we need 
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to be sure that this is not simply a consequence of imperfect samphng. The first studies of 
the effects of undersamphng on network topologies focused on the relation between true and 
observed degree distributions, either analytically [3] H] or via numerical simulation |5] , and found 
that undersamphng changes qualitatively the shape of the degree distribution. Subsequent studies 
[HI [7] , based on numerical simulation, revealed the effects of undersamphng on topological features 
other than the degree distribution, such as clustering coefficients, assortativity, and the occurrency 
frequencies of local motifs. More recent publications were devoted to sampling of non-biological 
networks, such as the internet p] and bipartite networks [9J. So far all published studies on the 
effects of sampling have either been based on numerical simulations, or been restricted to the 
effects of sampling on a network's degree distribution. Moreover, there are only very few studies 
that considered biased sampling (e.g. and none that investigate oversampling. In the present 
paper we use statistical mechanical methods from the theory of tailored random graphs to study 
systematically the effects of sampling on macroscopic topological features of large networks. We 
extend previous work in several ways. Firstly, we investigate the effect of sampling on macroscopic 
observables beyond the degree distribution, e.g. the joint degree distribution of connected node 
pairs from which one calculates quantities such as the assortativity. Secondly, we do this for 
both unbiased and biased sampling of either nodes, links or both. Thirdly, we investigate not 
only network undersamphng, i.e. the implications of false negatives in the detection of links 
and/or nodes, but also the effects of oversampling, i.e. the implications of false positives. All 
our results are obtained analytically, and formulated in terms of explicit equations that express 
degree distributions and degree correlation kernels of observed networks in terms of those of the 
underlying true networks. We test our analytical predictions against numerical simulations and 
find excellent agreement. 



2. Definitions 



2.1. Networks and sampling protocols 

We consider non-directed networks or graphs. Each is defined by a symmetric matrix c = {%}, 
with i, j = 1 . . . and with Cij G {0, 1} for all (i, j). Nodes i and j are connected if and only 
if Cij = 1. We exclude self-interactions, i.e. ca = for all i. The degree ki{c) of a node i is 
^j(c) = J2j Cij, the degree distribution of graph c is p{k\c) = N''^ J2i ^fc,fci(c), and we abbreviate its 
degree sequence as k(c) = {ki{c), . . . , kiy{c)). Sampling stochastically an iV-node graph c wiU 
result in observation of an A^' node graph c'. The relation between c' and c depends on the details 
of the sampling process. We use random variables G {0, 1} to denote whether a true node i 
is observed, and Tij G {0, 1} whether a link {i,j) is observed (if nodes i and j are). In studying 
oversampling Ajj G {0, 1} will indicate whether an absent link is falsely reported as present. Thus: 

node undersampling : c'^j = aiajCij 

bond undersampling: c[j = TijCij 

node and bond undersampling: c[j = aiCTjTijCij (1) 

bond oversampling: c[j = Cij + (1 — Qj)Ajj 
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In a biological context, node oversampling (e.g. detecting a nonexistent protein) would be 
unrealistic, so will not be considered in this paper. Note that A^' = Y,i<N ^i- We take all 
sampling variables cr = {cTj}, r = {rjj} and A = to be distributed independently, with the 
proviso that Tij = Tji, Xij = Xji and Xu = (so sampled networks remain nondirected and without 
self- interactions) . In unbiased sampling their probabilities are independent of the site indices; in 
biased sampling the probabilities will depend on the degrees of the nodes involved. We conclude 
that the different types of sampling under ([1]) are all special cases of the following unified process: 

c'ij = OiOj [TijCij + (1 - Cij)Kj] W{i < j) (2) 

with 

P{a-,T,X\x,y,z) = Yl [x{ki)6^^^i + {l-x{ki))6^^fi'\.Y[ [y{ki,kj)6r^^^i + {l-y{ki,kj))6r,^fi 

i i<j 

xn[^%^^A.a+(i-^%r^)^A.,,o] (3) 

i<j 

Here x{k) G [0, 1] gives the likelihood that a node with degree k will be detected, y{k, k') G 
[0, 1] the likelihood that a link connecting nodes with degrees (fc, k') will be detected, and 
z{k,k')/N G [0,1] the likelihood that an absent bond will be falsely reported as present (the 
latter scales as A^~^ to retain finite connectivity for large A^). For unbiased sampling the 
control parameters in ([3]) would all be degree-independent, i.e. x{k) = x, y{k, k') = y and 
z{k, k') = z. We note that, since nonexisting nodes cannot give false negatives, we may always 
choose a;(0) = y{0, k) = y{k, 0) = for all k. Typical choices for biased sampling would be 
x{k) = k/kmax and y{k,k') = z{k,k') = kk'/k"^^^, i.e. high-degree nodes and links connecting 
high-degree nodes are more likely to be reported. 

2.2. Macroscopic characterisation of network structure 

To control analytically the topological properties of the networks to which our sampling protocols 
([1]) are applied, we consider the following maximum entropy ensemble of typical graphs with 
prescribed degrees and prescribed degree correlations: 

= -^lU^k^Mo] n + - ^^TTV^KA (4) 

Z^^i ^ t<jN p{ki)p{kj) V Np{k,)p{kj)^ i 

with p{k) = N^^J2i ^k,ki and k = J2kP{k)k, and with Zjv the appropriate normalisation constant. 
Graphs generated according to (jl]) will have k(c) = k, p{k\c) = p{k), and J2cP{^)^{ky ^'\^) = 
W{k, k'), where W{k, k'\c) = {Nk)~^ CijSk,ki^k',kj is the joint degree distribution of connected 
node pairs. Apart from the information in k and W{k,k), the ensemble (HI) is unbiased; see 
[To] for derivations of its information-theoretic properties, and lllj for MCMC algorithms via 
which its graphs can be generated numerically. The remainder of this paper is devoted to 
calculating analytically how in typical large networks, with given degree sequences and given 
degree correlations (i.e. those generated via (jlj), sampling affects the macroscopic topological 
characteristics p{k) and W{k,k'). To be specific, we calculate the following quantities in terms 
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of the sampling characteristics {x{k),y{k, k'), z{k, k')}: 

k(x,y,z) = hm pfc)/^^^— ^\ . (5) 

=^im^p(c)( ' /' " ) (6) 

C ^2 2 

t'lx. y. .) ^ ^im E P(c)( "'"'f- f"^- (7) 

c ij ' ' 

with c'^j as defined in ([2]). The denominators are simphfied trivially, using the independence of 
the sampling variables and the definition of W{k, k'\c), since 

j^j:^^ =lJ:^ik^)+o{N-'^')=Y: p{k)x{k) + o{n-'/') (s) 

i i k 

^ E 4- = ^ E ^(kMk,) [%^] + c.Mk. k,) - ^^]] + 0(iV-V2) 

= ^x(A;)x(A;'){p('^)p('^')-2('^, k') + kW{k, k')y{k, k')} + 0{N-^/^) (9) 

kk' 

We may therefore write 

- _ Egg' x{q)x{q')[p{q)p{q')z{q, q') + kW{q, q')y{q, q')] 

Y.qP{q)x{q) 



W(k,k'\x,y,z) = je/cr,T,A /^2) 

k{x,y,z)J2qP{q)x{q) 

3. Effects of sampling on degree distributions 

3.1. Connection between observed degree distributions and degree correlations 

We note that in the case of biased sampling the average degree (ITUi) in the observed graph will 
generally depend not only on the degree distribution of the original graph but also on the latter's 
degree correlations. Hence our decision to use the graph ensemble @ for the present study. 
The observed distributions p{k\x,y,z) and W{k, k'\x,y, z) in flllfl2p are connected via a simple 
identity, as are p{k) and W{k, k') in the original graph c: 

W{k\x,y,z)= J2W{k,k'\x,y,z) 
k' 

= JTooI(^Ep(c)(^E^^^.,e.4.W,a 

k 

k{x,y, z) 

So for large we need to calculate in principle only W{k, k'\x,y, z), as p{k\x,y,z) follows via 
f|T3l) . Alternatively, since for unbiased sampling p{k\x,y,z) can be found analytically with little 
effort, the identity ( fT3|) can be used for verifying the result of our calculation of expression (fT2l) . 
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3.2. Degree distribution for unbiased sampling 

Calculating p{k\x, y, z) is only straightforward for unbiased sampling, irrespective of whether the 
source graph is generated according to dl]), since in that case (ITT]) can be made to factorize 
over the sampling variables by writing the Kronecker-5 in integral form. In order to appreciate 
the roles played by the different ingredients of expression f lTT]) . we first write it in the form 
p{k\x,y,z) = lim7v^ooEcP(c)Piv(fc|a;,i/, z;c), with 

dw 



For unbiased sampling protocols, where x{k) = x, y{k, k') = y and z{k, k') = z, this expression 
immediately simplifies to the transparent result 

do; -.i 



Pn 



(fc|x,y,^;c) = ^p(fc'|c)|"^^e''=-+-(-"-i){l + xy(e-^^-l)}^^ 

= $^p(fc'|c) E ^ E ( / _ei'=-(e---l)"+- + C(Ar-5) 



n>0 '"■ m=0 

k' 



X 



Vp(fc'|c)V^ V f )f , )x"+'"-^'"(-l)"+™~^/(A;<n+m) 



fc' n>0 m=0 



+ C>(Ar-2) (15) 

in which I{S) is the indicator function (i.e. I{S) = 1 if is true, otherwise I{S) = 0). The 
observed average degree (ITU]) for unbiased sampling is, as expected, 

k{x,y, z) = x{z + yk) (16) 

Formula (1151) simplifies further for various special cases. For instance: 

• Unbiased bond and/or node under sampling, i.e. z = 

p(k\x,v,0) = (It,)' p(<:')(j.,*^ J(i -ly)''-' 



How sampling affects macroscopic features of complex networks 



6 



This implies that if we sample from a graph with Poissonian degree distribution, i.e. 
p{k) = k''e~^/k\, then the degree distribution of the sampled graph will be 

i.e. again a Poissonian distribution, but with a reduced average degree k{x, y, 0) = xyk. This 
recovers earlier results of [21 H] . We note also that ( IT71) is invariant under exchanging x and 
y, so sampling all nodes and a fraction x = ^ of the bonds is equivalent to sampling all bonds 
and a fraction y = ^ of the nodes. We show in Section l4?T] that this equivalence between bonds 
and nodes under unbiased undersampling also holds for the degree correlations. In Figure [U 
we show the predicted degree distributions (ITTl) together with the corresponding results of 
numerical simulation of the sampling process, for synthetically generated networks with size 
= 3512 and average connectivity k = 3.72 (as in the biological protein interaction network 
of C. Elegans [12j) and Poissonian and power-law degree distributions. The agreement 
between theory and experiment is perfect. 

Unbiased bond oversampling, i.e. x = y = 1: 

k' n>0 ' m=0 



( jn 



71 ' 

n>0 e=o 



e,k-k' 



= E Pik') E = i:Pik-i)e-'z'/il (19) 

k'<k s>o ^-y^ )■ i=Q 

As with unbiased undersampling we observe that sampling from a graph with Poissonian 
degree distribution, i.e. p{k) = k^e~^ /k\ leads to a sampled graph that is again Poissonian, 
but now with average degree k{l^l, z) = z + k: 

Uq^-k yk-q k -{k+z) L.] /h\'^ 

e-Ck+^)(k + z)' 

k\ \' ■ zj k\ ^ ^ 

Results from numerical simulations applied to Poissonian and preferential attachment 
networks are shown in Figure [2] together with the corresponding theoretical predictions. 
Again the agreement between theory and experiment is perfect. 

3.3. Degree distribution for biased sampling 

In the case of biased sampling, where x{k), y{k,k') and z{k,k') are no longer all degree- 
independent, one can no longer evaluate (12T|) without knowledge of the degree-degree correlations 
in the sources graph c. However, the average ( |2T1) over the graph ensemble (12T1) with controlled 
degree correlations is still feasible. In Appendix A we calculate the marginal ( ]A.24p of the 
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Figure 1. Effect of unbiased node undersampling (top panels) and bond undersampling (bottom 
panels) on the degree distribution of synthetically generated networks with size N = 3512, average 
connectivity k = 3.72 and Poissonian degree distribution (left panels) or power- law distributed 
degrees (right panels). Different symbols correspond to different fractions of sampled nodes 
(0.5, 0.7 and 1 as shown in the legend) and predicted values (symbols connected by dotted lines) 
lay on top of data points from simulations (symbols connected by dashed lines), obtained by 
averaging over 50 samples. 



expected kernel W{k, k'\x,y, z) for the sampled network, from which we obtain p{k\x,y,z) via 
the connection (fT3|) . One always has p{0\x,y,z) = 0, whereas for A; > 0: 

_ Eg xiq)p{q){a{q)Jik\q) + g%)£(A:|g)} 



with 



min{fc-l,g} k-l-n( \ 

J{k\q) = e-<'^^ E ( )7T-TZ^&"(?)(l-%)r" (22) 

min{fc-l,g-l} i 

C{k\q) = E . ^^> "(g)(l-%)r^-^ (23) 

n=o ^ n ' {K-i-n)\ 

«(g) = E piQ'MQ'XQ^ ?'), Kq) = -^t E ^(^X?, q')wiq, q') m 

q'>0 QP\Q) q'>0 
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The average connectivity k{x,y,z), as given in flTU]) . is easily obtained from (^I^ using 
normalization of the conditional probabilities J^{k\q) and C{k\q) 

Hx, y, z) = }_^kp{kx, y, z) = — 25 

k EqP{q)x{q) 

Let us now work out these results for the 'natural' types of sampling bias, where the likelihood 
of observing nodes or links is proportional to the degrees of the nodes involved, with a G [0, 1]: 

• Biased node undersampling, i.e. x{k) = ak/k^s,^, y{k, k') = 1, z{k, k') = 0: 

Here we have 

a{q) = 0, qb{q)C{k\q) = k{ ^ y{q){l - biq)^-' I{q > k) (26) 

Ep(^M9) = ^' %)= Eg^^(g^gO (27) 

This leads to 



and 



1, 0) = E ^^'Wiq, q') = (29) 



where k^^'^ = N ^ J2ijkeCijCjkCke is the average number of paths of length 3. 

Biased bond undersampling, i.e. x{k) = 1, y{k,k') = akk'/k^^^, z{k,k') = 0: 
This choice leads again to ( l26l) . while equations ( 1271) are now replaced by 

Y.v{q)x{q) = 1, b{q) = E q'W{q, q') (30) 

q P{q)'^ma.x q'>0 

Hence, one gets 

p(*|i,«,o) = Y.p(,){l){,^L_ Y, E I'wM))'-' (3i) 

q>k i^ly/'^max q'>0 t'K'dJ'^ma.x qi>Q 

and 

Ml, «, 0) = E ??'W^(?' = (32) 

max qq'>0 max 



Biased bond oversampling, i.e. x{k) = y{k, k') = 1, z{k, k') = akk'/k"^^: 
Here we have 

a(?) = E V{q')z{q,q') = T^q^ Kq) = 1> Ep(^)^(?) = 1 (33) 

ij'>0 "^max q 
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C(k\q) = e--^'^^^^Iiq<k) (34) 
J{k\q)a{q) = e-(«)^^-^A^ J(g < k - 1) ^ (k - q)C{k\q) (35) 



Substituting into fl^ and fl25]) yields 



k~q 



and 



p{k\l, l,a) = = Ep(9)^"^"'^'- ^^"^/^^^ (36) 

- - ttfc'^ 

fc(l,l,a) = A; + -^ (37) 

"'max 

In Figure [3] we show the predicted degree distribution fl36l) together with the corresponding 
results from numerical simulations of the biased bond oversampling process. 



3.4- Summary 

We have seen that the degree distributions of large sampled networks can be calculated and 
written explicitly in terms of the topological characteristics of the true network, for unbiased 
and biased under- and oversampling. From the resulting equations we can draw the following 
conclusions: 

• Sampling generally affects the shape of the degree distribution of a network, with the 
exception of a Poissonian distribution (as for Erdos-Renyi graphs), where the sampled 
network will only have a rescaled average degree compared to the original. 

• The degree distribution observed after unbiased node undersampling of a network is identical 
to that following unbiased bond undersampling, for any large graph, if the two (node- or 
bond-) sampling probabilities are identical. 

• In contrast, biased node undersampling (where the probability of observing a node is 
proportional to its degree) generally leads to a network with a degree distribution that 
is different from the one that would result from biased bond undersampling (where the 
probabihty of observing a bond is proportional to the degrees of the two attached nodes). 



4. Effects of sampling on degree correlation function 



In Appendix A we calculate the degree correlation function W{k, /c'|x, z) of large networks that 
are sampled according to the general protocol ([2]), from graphs generated from (jl]). The resulting, 
expressed in terms of the topological properties p{k) and W{k^ k') of the true network, is 

W{k,k'\x,y,z) = 

E,,g^>oa;(g)x(gO{p(g)p(gO^(g,gOJ(fc|g)J(fc^|gO+W(g,gOl/(g,gO£(fc|g)£(fc%0} 

Hx,y,z) EqP{q)x{q) 

(38) 
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Figure 2. Effect of unbiased bond oversampling on the degree distribution of synthetic Poissonian 
graphs (left panel) and synthetic power law graphs (right panel), both with size N = 3512 and 
average connectivity k = 3.72. Different symbols correspond to different fractions z/N of 'false 
positive' bonds, with z = 0, 2, 5, 10 as shown in the legend. The theoretically predicted values 
(symbols connected by dotted lines) are found to lay perfectly on top of the data points from 
simulations (symbols connected by dashed lines), obtained by averaging over 100 samples. 
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Figure 3. Effect of biased bond oversampling (i.e. x{k) = 1, y(fc, fc') = 1, z{k, k') — akk' /kf^^^) 
on the degree distribution of synthetic Poissonian graphs (left panel) and synthetic power law 
graphs (right panel), both with size N — 3512 and average connectivity k — 3.72. Different 
symbols correspond to different values of z = ak'^ /k"^^^^ — 0, 2, 5, 10, as shown in the legend. 
Theoretically predicted values (symbols connected by dotted lines) are found to lay perfectly 
on top of the data points from simulations (symbols connected by dashed lines), obtained by 
averaging over 100 samples. 



with k{x, y, z) as given in flTOj) . two conditional distributions J{k\q) and J~.{k\q) defined in fl22|23p . 
and with the short-hands a{q) and 6(g) defined in flM|) . We will now work out this general result 
for the most common types of sampling, viz. node undersampling, bond undersampling, and 
bond oversamphng, including both unbiased and biased protocols. 



How sampling affects macroscopic features of complex networks 



11 



Degree correlations for unbiased sampling 

For unbiased sampling protocols where x{q) = x, y{q, q') = y and z{q, q') = z, one has a(g) = xz, 
b{q) = xy and C{k\q) = J{k\q — 1), so (155]) simplifies immediately to 

Y.,,,'>,{zp{q)p{q')J{k\q)J{ky)+ym{q,q')J{k\q-l)J{ky-l)} 

W[k,k\x,y,z) = ^ ■ — = (39) 

z-\- yk 

with 

mmlfc-l.g} k-l-nf \ 

J{k\q) = e-^^x'-' Y: { ) fu , ,> "(!- a:y)^-" (40) 

^n^(fc-l-n)! 



Formula (139!) simplifies further for various special cases: 
• Unbiased node and/or bond under sampling, i.e. z = 0. 
Here we obtain 



SO equation (15^ reduces to 



^(A:|g) = U _ 1 - ^2/)'"'^' /(g > A; - 1) (41) 



Ty(A:,A;'|x,y,0) = E E W^l^^?') _ ^ ] {xyf^'' - xyf^'^' (42) 

g>fc g'>k' \ / \ I 

We note that iy(x,y,0), like f fT7|) previously, is symmetric under exchanging x and i.e. 
node and bond unbiased undersampling lead to the same degree correlations. Therefore the 
equivalence between the two samplings is now fully established for large graphs drawn from 
ensemble (jl]). 

Equation ( H2|) clearly shows that sampling from graphs in which degree correlations are 
present will generally affect those correlations, even in Poissonian networks, in spite of 
the fact that there the degree distribution is only changed via a reduction of the average 
degree. Conversely, if we sample from graphs without degree correlations, i.e. for 
which W{k,k'^ = W{k)W{k') = p{k)p{k')kk' / k , equation ( H2|) reveals that the degree 
correlation function in the sampled graph factorizes in the product of its marginals as well, 
i.e. W{k,k'\x,y,Q) = W{k\x,y,0)W{k'\x,y,0). This means that unbiased bond and/or 
node undersampling from graphs without degree correlations does not generate any degree 
correlations. 

In order to observe how sampling protocols affect degree correlations we will monitor, 
instead of W{k,k') itself, the normalised kernel Il{k,k') = W{k, k')/W{k)W{k') which 
will by definition equal unity in the absence of degree correlations. Any deviation from 
n(/c, k') = 1 will thus signal the presence of degree correlations. We show the predicted degree 
correlations in the case of unbiased bond undersampling, together with the corresponding 
results of numerical simulations, for Poissonian and Power law graphs, in figures |4] and |5] 
respectively. In figure |6] we show numerical results and theoretical predictions for unbiased 
node undersampling from Poissonian and power law graphs. The agreement between theory 
and experiment is very satisfactory; deviations are small and consistent with finite size effects. 
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Figure 4. Normalised degree correlation function n(fc, fc') = W{k, k')/W{k)W{k') of 
synthetically generated Poissonian graphs with N — 3512 and k = 3.72 before (top panels) and 
after (middle panels) sampling a fraction y — 0.7 of the bonds of the original graphs (data result 
from averaging over 10^ samples) and their respective theoretical predictions (bottom panels). 



Unbiased bond oversampling, i.e. x = y = 1. 
Here we have 



Ak\q) = e 



yk—l—q 



{k-l-q)\ 



I{k >q + l) 



(43) 



How sampling affects macroscopic features of complex networks 



13 




1 5 10 15 20 25 30 35 40 

k 

k 




1 5 10 15 20 25 30 35 40 1 5 10 15 20 25 30 35 40 



k k 

Figure 5. Normalised degree correlation function n(fc, kk') of synthetically generated Power law 
graphs with — 3512 and k = 3.72 before (top panels) and after (middle panels) sampling a 
fraction y = 0.9 of the bonds of the original graph (data result from averaging over 10^ samples) 
and their theoretical prediction (bottom panels). 



SO using our earlier result from equation (ITQll 



k ^k-q 



p{k\l,l,z) = e'^Y.P(q)- (44) 
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Figure 6. Normalised degree correlation function Il{k, k') of synthetically generated Piossonian 
(left) and Power law (right) graphs with N — 3512 and k — 3.72 before (top panels) and after 
(middle panels) sampling a fraction x — 0.9 of the nodes of the original graph (data result from 
averaging over 10* samples) and their theoretical prediction (bottom panels). 



we may write 

Y,p{q)J{k\q)=p{k-l\l,l,z) (45) 

q>0 
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which leads to the transparent expression 

W{k,k'\l,l,z) 



15 



k + z 



k 



k,k' 



pik~l\lXz)p{k'~l\lXz) + e~^^- W{q,q') 

q,q' = l 



{k-q)\ {t -q')\ 
(46) 

We note for later that substituting f H5|) into fl2T|) and bearing in mind that C{k\q) = 
J{k\q — 1), a(g) = z and 6(g) = 1, we have 



which yields 



p{k\l, l,z) = j\zp{k - 1|1, l,z) + Yp{q)qJ{k\q - 1) 

9>1 



p{k - 1|1, 1, z) = -p{k\l, l,z)--J2 W{q)J{k\q - 1) 



(47) 



(4J 



where W{k) = kp{k)/k. 

We now study the effects of oversampling on graphs without degree correlations. Denoting 
which is ^-dependent via the function J, we may rewrite (H6|) as 



W{k,k'\l,l,z) 



k + z 



k 



k 



k' 



k 



-p{k\lXz) - -S,{k) -p{k'\l,l,z) - -S,{k' 



+ - E Wiq,q')Jik\q-l)Jik'\q'-l] 



q,q'>l 



If the original graph has no degree correlation, i.e. 



Wiq,q') = Wiq)Wiq')=piq)piq^ 
the sampled graph will have degree correlation 



(50) 



(51) 



W{k,k'\l,l,z) 



k + z 



k 



'k' 



k 



k 



p{k\l,l,z) - -S,{k) -p{k'\l,l,z) - -S,{k') + -S,{k)S,{k') 



k + z 



Wik\l,l,z)Wik'\l,l,z) + S,{k)S,{k') - W{k\l,l,z)S,{k') - W{k'\l,l,z)S,{k) 



(W^(A;|1,1,^) - S,{k)){W{k'\lXz) - S,{k')) + jW{k\lXz)W{k'\lXz) 

k 



W{k\lXz)W{k'\lXz) + -{W{k\lXz) - S,ik))iWik'\lXz) - S^ik')) 



(52) 



where we have used W{k,k'\l,l, z) = kp{k\l,l, z)/{k + z), in accordance with f|T3|) and 
(1T61). For z = 0, J{k\q) = 6k,q+i and W{k\0) = So{k) = W{k) so the second term 
in (152!) vanishes, however for z ^ this will be generally different from zero: crucially, 
but not unexpectedly, oversampling from a graph without degree correlations automatically 
introduces degree correlations. 
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Figure 7. Normalised degree correlation function n(fc, fc') of synthetically generated Poissonian 
(left) and Power law (right) graphs with N — 3512 and k — 3.72 before (top panels) and after 
(middle panels) adding a fraction z/N of bonds, with z — 1 (left) and z — 2 (right), and their 
respective theoretical predictions (bottom panels). Data obtained by averaging over 10^ samples. 



4-2. Degree correlations for biased sampling 

Let us now work out ( l38l) for the types of biased sampling considered above. 
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• Biased node undersampling, i.e. x{k) — a/c/Zcmax, y{k,k') — 1, z{k,k') — 
Here we have 

so our equations reduce to 
ak^ 



X 



X 



qq' 

fc'-l / - X 9' 



k' -I ) \ k^^qp{q) ^ / \ ^maxg'p(g') 

• Biased bond undersampling, i.e. = 1, y{k,k') — akk'/k^^^, z{k,k') — 



For this choice we obtain 



which leads to 



X 



X 



fc'-i 



• Biased bond oversampling, i.e. = 1, y(k,k') = 1, z{k,k') — akk'/k 



' 1 1„2 
max 



Here we get 



01 - 

(/c — 1 — gj! 
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Hence we obtain 

W{k,k'\x) 



Oikq Oikq 



"'max "Ttiax 



kW{qq') 



akq 



+ 



k + ak'^/k^^. 

a 'p{.l)'P{.l')ll' ( C(kq 

^Ak-q-mk'-q'-iy. 



{k - q)\{k' - q')\ \kl^ 

k~q / J i \ k' — q' 



'akq'" 
To. 

V max / 



k'^q' 



2 

max , 



akq' 



(62) 



4-3. Summary 

As was the case for the degree distribution, also the degree correlations can for a broad and class 
of sampling protocols be calculated exactly and in terms of fully explicit relations. In contrast 
to the degree distribution, for which the sampling problem had already been studied partly by 
other authors, we are not aware of any analytical results for degree correlations. Our equations 
revealed that: 

• Sampling will always affect the degree correlations of networks, even in the unbiased case, if 
the original networks had such degree correlations. 

• Uncorrelated networks will remain uncorrelated after sampling only for unbiased node and/or 
bond undersampling. Bond oversampling will in general introduce degree correlations, even 
in the unbiased case. 

• Unbiased node and bond undersampling both modify the degree correlations (and the degree 
distribution) in the same way, so they are equivalent for any graph with prescribed topological 
features p{k) and W{k, k'), as generated from 

• Node and bond undersampling cannot be mapped onto each other in the case of biased 
sampling; their effects are qualitatively different. 



5. Discussion 



It is well known that the presently available data on cellular signalling networks are incomplete, 
and often suffer from serious experimental bias, reflecting the highly nontrivial nature of the 
experimental methods available for their collection. Yet a significant number of research papers 
continue to be written in which such data are used to infer statements on the possible biological 
relevance of local network modules or motifs. In addition the signalling network data are 
increasingly used for preprocessing gene expression data in order to derive more robust disease 
specific prognostic signatures [131 [HI [15], cind will very soon impact on actual treatment decisions 
in medicine (e.g. will be used to suggest which cancer patients are likely to benefit from which 
chemotherapy). Given this situation, it is vital that we understand quantitatively the data 
imperfections, i.e. the relation between the true biological signalling networks probed and the 
imperfect network samples of these networks that are reported in public data repositories and 
presently used by biomedical scientists. To do this we need mathematical tools; the relevant 
networks are too large to rely on numerical simulation alone. Moreover, unlike simulations, 
analytical results can be used in reverse to infer the most probable true networks from the 
imperfect observed samples. 
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Figure 8. Effects of biased bond oversampling on tfie normalised degree correlation Il{k, k') 
of the synthetically generated Poissonian and Power law graphs {N — 3512, k = 3.72) shown 
in the top left and right panels, respectively. Middle panels show the result of simulations for 
afc^/fc^g^x = 1 (left) and ak'^/k'^^^ = 0.7 (right) and bottom panels show the corresponding 
theoretical predictions. Numerical data result from averaging over 10^ samples. 

Ensembles of tailored random graphs with controlled topological properties are a natural 
and rigorous language for describing biological networks. They suggest precise definitions of 
structural features, they allow us to classify networks and obtain precise (dis)similarity measures, 
they provide precise 'null models' for hypothesis testing, and they can serve as efficient proxies 



How sampling affects macroscopic features of complex networks 



20 



for real networks in process modelling. In this paper we have shown how they can also be used to 
study analytically the effects of sampling on macroscopic topological properties of large biological 
networks, under a much wider range of conditions than those considered in previous analytical 
studies (the latter are recovered as special simple cases). We have obtained explicit expressions 
for both degree distributions and degree correlation kernels of sampled networks, and have been 
able to do this for sampling protocols that involve node and/or link undersampling as well as for 
link oversampling. Our predictions are in excellent agreement with numerical simulations. 

As could have been expected, the most dangerous types of sampling are the biased ones, 
where the probability to observe bonds or links depend on the degrees of the nodes concerned. 
Unfortunately, present experimental protocols are quite likely to involve precisely such sampling. 
We therefore hope that our new analytical tools, which take the form of explicit and transparent 
equations that connect the topological structure functions p{k) and W{k, k') of the sampled and 
the true networks, can prove useful in explaining and decontaminating signalling network data. 
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Appendix A. Joint degree distribution of connected nodes 

Appendix A.l. Path integral representation of W{k, z) 

Here we calculate the joint degree distribution of connected nodes (fT2|) that will be observed in 
large networks that are sampled, according to protocol ([2]), from typical graphs with prescribed 
macroscopic topological features p{k) and W{k, k'), as generated from (jlj). With the short-hands 
W{k, k'\x, y, z) = W{k, k'\x, y, z)k{x, y, z) J2qP{q)x{q), k = (fci, . . . , /cat), and Q = {Qi, Qn) e 
[— TT, tt]^ we may write 



W{k,k'\x,y,z) = Ji^EMc)(^E4^fc,E.</^'.E 



''ii/(T,T,\ 



47r2 

k W{kr, k 



X E n 

c r<s 
1 



AT ^ \ 



'^dwdw' 



47r2 Af^oo 



cr,T,X 

k_ ^ x{ki)x{kj] 



k W{kr, kg) c- , k W{kr, kg 

Np{kr)p{h) '^'■-"^ Np{kr)p{h 



X E n [^...1 

c r<s 



f ('f"«Cif+Aif (l-Cif ))-iw' ■ o-f (TjfCjf+Ajf (1-Cj^)) 



1+ 



k W{kr,ks) ( 
N v{kr)v(ks) \ 



cr,T 



X 



X e 



X e 



(A 



We next introduce the following order parameters 

p(g,fi|k, n) = ^Y.^i,kA^-^r) 

and insert into ( 1A.1I) for each (g, fi) the following integral: 



(A.2) 



= {N/2ti) J dP{q, n)dP{q, ^yNH<i,^)Pii,^)-iEr^'>^^rm-nr) 

This converts ( lA.ll) into the following path integral, with the short-hand {dPdP} = 
Y[q^n[dP{q,fl)dP{q,Q)/2n] and with Z'j^ a new constant that apart from containing absorbs 
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various factors and constants that are generated when transforming sums over Vt into integrals: 
W{k,k'\x,y,z)= hm [^^E^^Nnm+mP]+0{N'^)y U^An' P{q,n)P{q\n')x{q)x{q') 

N^co J Z\t , J 

xE{q,n;q\n')( H ^e'^ik-D+i'^-'^-mii,^)) f r^gi^(fe'-i)+(e-'"- (^.4) 
in which $ [P, P] will eventually drop out of our formulae (via normalization) and 



dn P{q, n)P{q, fi) + ^ p{k) log J —e'- 

+ 1a?E / dfidfi' P(g, fi)P(g', ^')^|^(e-^^^+^') - 1) (A.5) 

u- q\ u') = z{q, q') + g'):fc^|^^e-'(-+-') (A.6) 

Q(g, = ^ / dn"P(g", n")x{q") E{q, Q; q", Q") (A.7) 

q" 

The relevant cj-integrals are of the familiar form 



6 =6 

- 27r n\ J. 271 



(unless ^ < 0, in which case the integral is zero). We also note that by definition we always have 
the normalisation identity J2k,k'>o W{k, k'\x, y, z) = 1. So we arrive at: 

Eq,q' x{q)x{q') fdndn'P{q, n)P{q', Q')Eiq, Q; q', n')I{k\q, Q)I{k'\q', Q') 



W{k,k'\x,y,z) = Sk,oSk',o 
with 



E,,,' xiq)xiq') Jdnan'Piq, n)P{q', n')E{q, n; q\ fi') 

(A.9) 



I{k\q, n) = e-«('?'^)Q'-^(g, n)/{k-l)\ (A.IO) 

and in which, via the steepest descent argument, the order parameters {P, P} are the functions 
that extremise the kernel flA.Sp . 

Appendix A. 2. Functional saddle-point equations 

Functional variation of flA.Sp gives the following saddle-point equations for {P, P}: 

iP(g,fi) = -A?E fan' P{q'.n')^^Ae-'^^^^'^ - 1) (A.ll) 



p{q)p{q') 



Anq-iP(q,Q.) 

Piq, ^) = P(^)TT7^-7^r-^ (A.12) 



Equivalent ly: 

-,ing+A:e-'"</)(g) 



iP(g, n) = kXiq) - A?e~-0(g), P(g, il) = Pil) (A.13) 
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with 

The integrals over Q in flA.14p and flA.lSP are again of the type flA.Sp . from which we derive 

Jdn P{q,n) =p{q) and /dfi P{q,n)e-'^ = Piq)q/k(f){q). This then converts (lAlifAlSjl into 

Since we know the marginal of the distribution W{k,k') to be ^(^) = kp{k)/k (which 
follows directly from its definition), we can immediately read off the solution of (1A.16I) : 

0(g) = A(g) = q/k (A.17) 

Insertion into (IA.13P and using (lA.SP gives the solution of flA.ll|A.12p in explicit form: 

iP{q, Q) = q- ge-'^, P{q, Q) = p{q)^—j-^ (A.18) 
Appendix A. 3. Final result for the distribution W{k, k'\x,y, z) 

We can now evaluate the various ingredients of ( ]A.9|) . The function Q{q, VL) becomes 

Q(g, fi) = E V{q')x{q')z{q, q') + ^e'^^^T E ^(^Oz/l?, q'W{q, q') (A.19) 



Hence 



<j'>0 t'y^' q'>0 

dQdQ' P{q, Q)P{q', Vl')E{q, Q; q', Q')I{k\q, n)I{k'\q', Q') 
p{q)q\ fdQ 



qi{k-iy. J 2tt 



J 271 



P{q')iq'V- f(^^' An'q'+q'e-^'^' -Q{q',Vl')f^k'-\(l Qn 



in which 



and 



= P{q)p{q')z{q, q')J{k\q)J{k'\q') + kW{q, q')y{q, q')C{k\q)£{k'\q') (A.20) 

Jik\q) = ^.o^^^l^Y)! £S (A.21) 
q- r ^iQ(5-l)+ge-i"^A,-l/ o\^-Q(i'^) 



^(^1^) = / 7^ e'^^^^-^^+^<^-"'Q^-^(g, fi)e-<^^'''^'^ (A.22) 
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Summation over k reveals that Z]fc>o<^(^k) = I]fc>o -^(^k) = 1 ? > 0, which leads to the 

final result: 

W{k, k'\x,y, z) 

_ Y.,,,'>ox{q)x{q'){p{q)p{q')z{q,q')J{k\q)J{k'\q')+m{q,^ 

Eg,g'>o x{q)x{q')[p{q)p{q')z{q, q') + kW{q, q')y{q, q')} 
_ Eg,g'>ox{q)x{q'){p{q)p{q')z{q,q')J{k\q)J{ky)+m{q,q')y{q,q')^^^^ 

Hx,y,z) EgP{(l)x{q) 

(A.23) 

with k{x, y, z) as given in (ITOl) . The marginals of W{k, k'\x, y, z) are obtained trivially by summing 
(1A.23P over /c', giving 

W{k\x, y, z) = ^ ^7 . w ^ ^-24 

Appendix A. 4- Explicit expression for the factors J{k\q) 

To carry out the integral in (]A.21|A.22|) we first write Q{q., fi) as Q{q., ^) = a{q) +b{q)qe^^^, with 

a{q) = E piQ'M9')4q, q'), Kq) = -^r E ^(?')?/(^, q'Wiq, q') (A.25) 

q'>0 qP[q) q'>0 

We note that, due to Y^g, W{q, q') = {q/k)p{q), we can be sure that a{q) G [0, 1] and 6(g) G [0, 1]. 
Substitution into (1A.2HA.22|) and integration over Q, for g > and A; > 0, then leads to 



k-l 



q'^[k 1)' „=o IT' J —n 271 

mm{fc-l,g} k-l-n( \ 

= e-'^('') E ( 1 (A.26) 



n=0 

and, similarly. 



min{A;— 1,9— 1} -. \ 



n=0 



n ^ (A; — 1 — n)! 



Clearly J{k\q) > 0,C{k\q) > for all {k,q). Since the factors flA.2iyA.22p also satisfy the 
normalization J2k>o>^kq = ^,J2k>o '^kq = 1 for all g > 0, they can be interpreted as conditional 
probabilities, as suggested by our chosen notation. 

Appendix A. 5. Tests 

To test our expression (1A.23P we set x{k) = x, y{k) = y and z{k, k') = z, and try to recover from 
(1A.24P via (IT^ our earlier results on the degree distribution for unbiased sampling. We now find 
a(g) = xz, b{q) = xy, and k{x,y, z) = x{z + ky), which implies that 

min{g,A;— 1} fc— 1— n 

J{k\q)=e'-x''' E O y^^^-^y^'' \k-l-n)l ^^'^^^ 
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and 

C{k\q) = J{k\q- 1) (A.29) 
Let us inspect the following cases: 

• Perfect sampling, i.e. x = y = 1 and z = 0. 

Now there should be no difference between the kernel W{k, k') and the observed kernel 
W{k, k'\l, 1,0) of the sample. Here we see that (1A.19P simplifies to Q{q,fl) = qe~^^; hence 
a{q) = and b{q) = 1, leading to C{k\q) = 6g^k and therefore to the correct identity 
W{k,k'\x,y,z) = W{k,k'). 

• Unbiased node and/or link undersampling, i.e. xy < 1 and z = 0. 
Now we have k{x,y, 0) = kxy and 

which gives 



J{k\q) = x''\ ^ ^ )y'-\l-xyr-'+'liq > k - 1) (A.30) 



W{k\x, y,0) = W p{k')k'{ l_ I ) ixy)'-\l-xyf^ (A.31) 
k'>k ^ 

and therefore we recover the correct expression 

p{k\x,y,0) = ^W{k\x,y,0) = {xyf ^ p{k')( J' )(l-xyf^ (A.32) 

k'>k 



Unbiased bond oversampling, i.e. x = y = 1 and z > 0. 

' (k-l~q)\- 



Now A;(l, 1, z) = k + z and J{k\q) = e ^ (t-i^ — which results in 



p{k\l,l,z) = ^W{k\l,l,z) = l{zY.p{q)Jik\q) + Y.p{q)qJ{k\q-l)} 

k 

= e-"5]p(A;-£)//^! (A.33) 

i=0 

which is indeed the correct result identified earlier. 



